journal of Speech and Hearing Research, Volume 39, 697-713, August 1996 



Generalizability of Speechreading 
Performance on Nonsense Syllables, 
Words, and Sentences: Subjects 
With Normal Hearing 



Marilyn E. Demorest 

University of Maryland 
Baltimore County 

Lynne E. Bernstein* 

Center for Auditory 
and Speech Sciences 
Gallaudet University 

Gale P. DeHaven 

University of Maryland 
Baltimore County 



Ninety-six adults with normal hearing viewed three types of recorded speechreading 
materials (consonant-vowel nonsense syllables, isolated words, and sentences) on 2 days. 
Responses to nonsense syllables were scored for syllables correct and syllable groups correct; 
responses to words and sentences were scored in terms of words correct, phonemes correct, 
and an estimate of visual distance between the stimulus and the response. Generalizability 
analysis was used to quantify sources of variability in performance. Subjects and test items 
were important sources of variability for all three types of materials; effects of talker and day of 
testing varied but were comparatively small. For each type of material, alternative models of 
test construction and test-score interpretation were evaluated through estimation of general- 
izability coefficients as a function of test length. Performance on nonsense syllables correlated 
about .50 with both word and sentence measures, whereas correlations between words and 
sentences typically exceeded .80. 

KEY WORDS: speechreading (lipreading), individual differences, assessment, visual 
speech perception, generalizability analysis 



Generalizability theory, developed by Cronbach, Gleser, Nanda, and Rajaratnam 
(1972), is a psychometric theory that provides a framework for analysis of the 
sources of variability in test scores. In a previous study (Demorest & Bernstein, 1 992), 
generalizability theory was applied to speechreading performance of 104 young 
adults with normal hearing who viewed videodisc recordings of 1 00 CID Everyday 
Sentences (Davis & Silverman, 1970) spoken by a male and a female talker. Three 
sources of variability in performance were investigated: talker, sentence, and 
subject. Performance was scored as words correct per sentence. The most 
important systematic sources of variability were found to be the sentence (26.3%), 
the subject (10.5%), the talker (4.9%), and the interaction of talker and sentence 
(5.1%). Other effects were negligible and residual error accounted for 51.2% of the 
variance. Five models for testing speechreading were evaluated and generalizability 
coefficients, which are analogous to reliability coefficients, were estimated. It was 
found that generalizability is highest when all subjects are tested with the same talker 
and the test score is interpreted as a talker-specific score. Generalizability is lowest 
when different subjects are tested with different talkers and the score is presumed to 
generalize across talkers. 

The present study extended the results obtained by Demorest and Bernstein 
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(1992) in several ways. First, three types of materials were 
used to assess speechreading: sentences, isolated mono- 
syllabic words, and nonsense syllables. Sentence materials 
included not only the CID Everyday Sentences, but also an 
additional 100 videorecorded sentences from Bernstein and 
Eberhardt (1986b). Second, the research design was ex- 
panded to include the test occasion as an independent 
variable. This permitted estimation of day-to-day variability 
in performance and derivation of generalizability coefficients 
that reflect that variability. Third, a variety of performance 
metrics was analyzed. Words and sentences were scored 
not only in terms of whole words correct, but also in terms of 
the number of phonemes correct and the visual dissimilarity 
of the stimulus and response. 



Test Materials 

Until the 1950s, research on speechreading was mainly 
concerned with determining the intelligence or cognitive 
abilities of deaf people (Parasnis & Samar, 1982). The most 
popular materials for assessing lipreading were sentences, 
but the most psychometrically sophisticated and most pop- 
ular test of lipreading (Utley, 1946a, 1946b) employed 
words, sentences, and stories. Contemporary investigations 
of speechreading that parallel those in auditory speech 
perception have emphasized identification of nonsense syl- 
lables (e.g., Montgomery & Jackson, 1983; Owens & Blazek, 
1985) and/or words in connected texts — frequently in iso- 
lated sentences (e.g., Walden, Erdman, Montgomery, 
Schwartz, & Prosek, 1981). 

Relatively few studies have employed identification of 
isolated words. Erber and McMahan (1976) showed that 
monosyllabic nouns in isolation were more intelligible than in 
sentences for 1 3-1 6-year-old speechreaders with profound 
hearing losses. This intriguing result, which runs counter to 
classical results in auditory speech perception (e.g., Miller, 
Heise, & Lichten, 1951), appears not to have received 
additional attention. Notwithstanding, Rosen and Corcoran 
(1982) asserted that "results from word lists prove to be poor 
predictors of performance in more natural speech tasks" 
(p. 246). 

Research on spoken word recognition predicts a higher 
level of association between performance on word and 
sentence identification tasks than between nonsense sylla- 
ble and sentence identification tasks. This is, first, because 
word identification fluency has a direct effect on ability to 
process the syntax and meaning of the sentence effectively 
(Duffy & Pisoni, 1992; Stanovich, 1980); and second, be- 
cause experimental results suggest that phoneme identifi- 
cation is not a stage in word recognition (for an excellent 
discussion see Marslen-Wilson & Warren, 1994). By select- 
ing for the current study stimuli at the nonsense syllable, 
word, and sentence levels, it was possible not only to test 
the generalizability of the performance associated with these 
different levels, but to determine the correlations of perfor- 
mance metrics across the three tasks. Strong correlations 
could indicate commonality among processes responsible 
for performance with each level of materials. 



Test Occasion 

Several studies have examined scores across two or more 
presentations of the same speechreading materials (e.g., 
Eberhardt, Bernstein, Demorest, & Goldstein, 1990; Plant & 
MacRae, 1981) and have shown that mean performance 
increased as a function of test occasion. This result can be 
interpreted as learning of the task and/or test materials. 
Other studies have examined whether various types of 
training bring about improved speechreading (e.g., Gesi, 
Massaro, & Cohen, 1992) and have shown some improve- 
ments. Studies such as these concern changes in group 
mean performance over time, with or without training, but 
they do not provide estimates of measurement error that can 
be used to evaluate change over time for individuals. Day- 
to-day variability of performance, in the absence of system- 
atic intervention (e.g., training, feedback, and/or sensory 
aids), provides a basis for such estimates. When subjects' 
day-to-day variability is small, relative to the magnitude of 
individual differences, generalizability over test occasions is 
high, and vice versa. The magnitude of measurement error 
associated with test occasion was examined here by divid- 
ing the total testing time between two sessions separated by 
several days. 

Performance Metrics 

Different performance metrics can be derived from the 
same response. For nonsense syllables, one can count 
syllables correct, or can score more leniently in terms of 
putatively homophenous (visually indistinct) groups of pho- 
nemes. There is also more potential information in re- 
sponses to words and sentences than is revealed by simply 
counting words correct. In previous work (Bernstein, Demo- 
rest, & Eberhardt, 1994), we developed a sequence compar- 
ator that uses empirical data about the visual confusability of 
individual consonant and vowel phonemes (Eberhardt et al., 
1990; Montgomery & Jackson, 1983) to derive a phoneme- 
by-phoneme alignment between a stimulus and a response. 
Among the performance measures that can be derived from 
alignments are the number of phonemes correct and the 
visual phonetic dissimilarity between the stimulus and re- 
sponse. Visual phonetic dissimilarity is the sum of the 
distances of all the individual phoneme-to-phoneme align- 
ments, where the distance metric is the Euclidean distance 
between phonemes in a multidimensionally scaled percep- 
tual space (see Bernstein et al., 1994). 

The potential usefulness of these measures for character- 
izing speechreading performance has been illustrated by 
Demorest and Bernstein (1991). Nevertheless, it cannot be 
assumed that alternative measures of performance derived 
from the same response are psychometrically equivalent. 
Therefore, in this study nonsense syllables were scored in 
terms of syllables correct and syllable groups correct, and 
words and sentences were scored in terms of words correct, 
phonemes correct, and visual distance (the measure of 
visual phonetic dissimilarity). 

To summarize, the purpose of the present study was to 
apply generalizability theory to three types of stimulus 
materials used in assessment of speechreading: sentences, 
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words, and nonsense syllables. The experimental design 
also extended our previous work by (a) including test occa- 
sion as a source of variability, (b) describing performance on 
words and sentences in terms of words correct, phonemes 
correct, and a measure of visual distance between the 
stimulus and response; and (c) examining correlations 
among materials and methods of scoring performance. 

Method 

Subjects 

Subjects were 1 04 graduate and undergraduate students 
from the University of Maryland Baltimore County who were 
paid for their participation. A brief questionnaire was used to 
screen subjects for normal hearing, for normal (or corrected) 
vision, and for English as a native language. During the 
course of the experiment, 8 subjects had to be replaced due 
to attrition, equipment malfunction, or failure to meet selec- 
tion criteria. The remaining 96 subjects ranged in age from 
18 to 45 years (M = 22.6, SD = 6.4), with the majority (81 %) 
less than 25 years. Twenty-eight subjects (29%) were male. 
None of the subjects reported having had any training in 
speechreading. 

Materials 

All stimulus materials were high-quality videorecordings 
(Bernstein & Eberhardt, 1986a, 1986b) of a male and a 
female talker who spoke General American English. The 
videorecording procedures have been described by Demor- 
est and Bernstein (1992). 

Nonsense syllables. Consonant-vowel (CV) nonsense 
syllables, spoken by both talkers, were taken from the 
Bernstein and Eberhardt (1986a) videodisc. There were two 
tokens from each talker of 22 initial consonants combined 
with the vowel Id, plus two tokens of the vowel Id alone. 
One token of each syllable was arbitrarily designated Token 
1 ; the other was designated Token 2. The consonants were: 
/p, b, m, f, v, 0, a, w, r, tf, cfe, J, 3 , t, d, s, z, k, g, n, I, IV. 

Words. Monosyllabic words were selected from the Bern- 
stein and Eberhardt (1986a) recordings of the 300 words of 
the clinical version (Kreul, Nixon, Kryter, Bell, & Lamb, 1968) 
of the Modified Rhyme Test (MRT; House, Williams, Hecker, 
& Kryter, 1965). All words were spoken by the male talker. 
The MRT comprises 50 six-word ensembles that are used 
for testing speech in a closed-response format. Four words 
were selected from each of the 50 ensembles so as to yield 
200 different words. Two sets of 100 words were generated 
by assigning two words within each ensemble to Set 1 and 
two words to Set 2. 

Sentences. Previous findings (Demorest & Bernstein, 
1992) indicated that our female talker tends to be more 
difficult to speechread than our male talker. An attempt was 
therefore made to equalize the difficulty of the talkers by 
selecting sentences for the female talker that had yielded 
relatively high mean performance. Fifty CID Everyday Sen- 
tences were selected for each of the two talkers, and an 
additional 50 sentences per talker were taken from Corpus 



III and Corpus IV of Bernstein and Eberhardt (1986b); the 
latter will be termed B-E Sentences. The B-E Sentences 
were selected from those used in baseline (unaided) testing 
by Eberhardt et al. (1990). The mean number of words 
correct per sentence obtained by the 15 subjects in that 
experiment was used to guide sentence selection. 

Procedures 

General procedures. The subject was seated at a small 
table in a darkened, 10' square, sound attenuating room. 
The same method of stimulus presentation was used for 
nonsense syllables, words, and sentences. A videodisc 
player (Sony Lasermax LDP 1550) was controlled by a 
Compaq Portable computer that was placed on the table. 
The computer was used to instruct the subject during testing 
and to record the subject's responses. A small, high-inten- 
sity lamp illuminated the computer keyboard. Stimuli were 
presented on a 19" high-resolution color monitor (Sony 
Trinitron PVM 1910) placed at a distance of 2 m from the 
subject. The subject initiated presentation of the first stim- 
ulus by pressing a key on the computer keyboard. After a 
brief pause, the first frame of the stimulus was presented for 
2 s, then the remaining frames were played. The final frame 
remained on the screen until the subject's response was 
completed. After a brief pause, during which the monitor 
was cleared, the next stimulus was presented. 

The three types of materials were presented to each 
subject on each of 2 days. Test sessions took 1 Vfe to 2 hours 
to complete, and the retest interval ranged from 5-12 days 
(M = 7.1, SD = 0.9). Because of subject replacement, order 
of presentation of the syllables, words, and sentences was 
not perfectly counterbalanced, but a minimum of 15 sub- 
jects received each possible order. 

Nonsense syllables. On each day subjects were tested 
with two 92-item lists, one for each talker. Each list con- 
sisted of two repetitions of the 44 CV tokens and two 
repetitions of the two Id tokens. Item order was pseudo- 
randomized at the time of list presentation, and talker order 
was counterbalanced across subjects. Selected keys of the 
computer's keyboard were labeled with 23 one- and two- 
character phonemic codes that the subject pressed to 
record a response. Before testing began, the response 
codes were explained and clarification was provided if 
necessary. A cue card was available for reference through- 
out the test session. It paired each response code with a 
word that illustrated its pronunciation. On Day 1, the subject 
was given a five-item practice list. 

Words. On each day, subjects were tested with 100 MRT 
words in Set 1 or Set 2. Word order within each set was 
randomized at the time of testing, and set order was 
counterbalanced across subjects. Subjects were informed 
that each word consisted of a single syllable and that each 
word would appear only once. They were instructed to type 
whatever they thought the speaker had said and were given 
as long as necessary to respond. Editing of the response 
was permitted, as were blank responses. A five-word prac- 
tice list was given on Day 1 . 

Prior to data analysis, responses were reviewed for obvi- 
ous typographical and spelling errors (e.g., "trian" for "train" 
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or "whor" for "whore") and were edited if there was no 
ambiguity concerning the intended response. Responses 
were also edited if they were homophones of the stimulus 
word (e.g., "piece" was replaced with "peace"). Responses 
were transcribed automatically into a phonemic notation 
prior to analysis by the sequence comparator (see Bernstein 
et al., 1994). 

Sentences. On each day subjects viewed four sets of 25 
sentences: 25 CID sentences and 25 B-E Sentences for 
each talker. Talker order, the order of CID versus B-E 
Sentences, and the specific sets viewed on a given day were 
counterbalanced across subjects. 

Subjects were informed that they would see a series of 
unrelated sentences and were instructed to type exactly 
what they thought the talker said. Partial responses, includ- 
ing word fragments, were encouraged, but subjects were 
instructed to keep all portions of their response in chrono- 
logical order and to type their response in lower case. 
Subjects were also instructed to use Standard English 
spelling and to use contractions only if that is what they 
thought the speaker actually said. As in the MRT word task, 
editing of the response was possible until the enter key was 
pressed. Following each response, the subject was asked to 
provide a confidence rating on a scale ranging from 0 to 7. 
Confidence ratings were collected as part of a related 
research project (Demorest & Bernstein, 1 994) and are not 
discussed further in this report. A three-sentence practice 
list was given on Day 1 . 

The same editing procedures used with MRT words were 
used for responses to sentences. Additional editing was 
performed so that the response would conform to the format 
specified by an automated word-scoring program that 
counts words correct. For example, punctuation was added 
for contractions, and numerals and number words were 
interchanged so as to match the form of the stimulus script. 
Finally, the edited responses were phonemically transcribed. 



Measures 

Nonsense syllables. Performance on individual syllables 
was scored in two ways. The syllables correct measure was 
scored 1 if the syllable was correct and 0 otherwise. For the 
second measure, syllables were classified into seven groups 
with high mutual confusability based on consonant confu- 
sion matrices derived from the data of Eberhardt et al. 
(1 990): /p, b, ml, /f , v/, /8, 5/, /w, r/, /tf , <fe, J, 3/, /t, d, s, z/, and 
/k, g, n, I, h, a/. Syllable group was scored 1 if the stimulus 
and response phonemes were in the same group and 0 if 
they were not. 

Words. Performance on each MRT word was scored in 
terms of whole words correct (WC), phonemes correct, and 
visual distance. Phonemes correct and visual distance were 
obtained by submitting each transcribed response to the 
sequence comparator. Non-responses were scored as zero 
phonemes correct and assigned a visual distance consistent 
with deletion of all stimulus phonemes. Stimulus length 
ranged from three to five phonemes, so phonemes correct 
and visual distance were each divided by stimulus length to 
normalize the measures; these are termed phonemes cor- 



rect normalized on stimulus length (PCNS) and visual dis- 
tance normalized on stimulus length (VDNS), respectively. 

Sentences. Each sentence was scored in terms of three 
performance measures. The total number of words correct 
in the sentence was determined by a computer-scoring 
algorithm that matched words in the stimulus with words in 
the response. Word order was taken into account, and 
contractions were treated as two words. Because sentences 
varied from 2 to 15 words in length, the performance 
measure was words correct normalized on stimulus length 
(WCNS). 

As with the MRT words, the sequence comparator was 
used to obtain two additional measures: phonemes correct 
normalized on stimulus length in phonemes (PCNS) and 
visual distance normalized on stimulus length in phonemes 
(VDNS). Nonresponses were scored as zero phonemes 
correct and were assigned a visual distance consistent with 
deletion of all stimulus phonemes. 

Generalizability Analysis 

In generalizability theory, an observed score is modeled as 
the sum of various fixed and random effects that reflect 
conditions of testing (e.g., day, talker, or test list). The 
interpretation of a given subject's observed score, however, 
typically involves a universe of generalization across those 
conditions that are deemed irrelevant to the purposes of 
testing. A subject's universe score is defined as the ex- 
pected value of the observed score across those conditions. 
Observed-score variance is partitioned into universe-score 
variance and variance due to the irrelevant effects across 
which generalization occurs. 

Generalizability theory permits derivation of important 
practical implications for test development, use, and inter- 
pretation. Given a proposed testing protocol and a definition 
of the universe of generalization, it is possible to derive 
expressions for expected observed-score variance and uni- 
verse-score variance, respectively. The ratio of universe-score 
variance to observed-score variance is called a generalizability 
coefficient (symbolized p 2 ). It is directly analogous to a reliabil- 
ity coefficient as defined in classical test theory (Allen & Yen, 
1979; Crocker & Algina, 1986). 

In the results below, expressions for and interpretations of 
generalizability coefficients are given in more detail. For each 
type of test material, a structural model for observed scores 
is specified and the variance accounted for by each effect in 
the model is estimated. (Examples of the estimation proce- 
dures can be found in Demorest & Bernstein [1991, 1992].) 
Formulas for the generalizability coefficients are presented, 
and empirical estimates of generalizability are obtained for 
models of testing and test-score interpretation that might be 
implemented in a clinical or research setting. 

Results and Discussion 

Nonsense Syllables 

Descriptive statistics. Means and standard deviations 
for the two measures of performance on CV syllables are 
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TABLE 1. Descriptive statistics for two measures of speech- 
reading performance on nonsense syllables as a function of 
talker and day (N = 96). 

Syllable groups 





Syllables correct 


correvi 






M 


SD 


M 


SD 


Talker 1 


.294 


.052 


.777 


.077 


Day 1 
Day 2 


.286 
.302 


.065 
.064 


.766 
.787 


.088 
.081 


Talker 2 


.329 


.056 


.847 


.096 


Day 1 
Day 2 


.321 
.337 


.063 
.072 


.829 
.865 


.109 
.100 


Both talkers 


.312 


.045 


.812 


.079 


Day 1 
Day 2 


.304 
.320 


.050 
.056 


.798 
.826 


.086 
.081 



shown in Table 1 as a function of talker and day. Overall, 
31.2% of the syllables were correctly identified, and 81 .2% 
of the responses were in the correct syllable group. Ob- 
served means were higher for Talker 2 and for Day 2 on both 
measures, but the talker difference for syllables correct was 
not significant. 

Structural model. For each performance measure, the 
unit of observation was defined as the average across a 
single token of the 23 syllables for a given talker on a given 
day. Each subject therefore had eight observed scores (2 
tokens x 2 talkers x 2 days). Talker (T) was a fixed 
within-subjects factor, and token (To) and day (D) were 
random within-subjects factors. Token was nested within 
talker; the remaining factors were crossed. It was assumed 
that the highest-order interaction, Subjects x Day x Token, 
was zero. Preliminary analysis revealed that only one of the 
12 effects involving talker order was significant and that it 
accounted for only 0.01 % of the variance. Talker order was 
therefore not included in the structural model. 

Estimation of variance components. The estimated 
variance attributable to each effect in the structural model is 
shown in Table 2. When performance was scored strictly in 
terms of syllables correct, the largest systematic source of 
variance was subjects, but it accounted for only 15.9% of 
the total variance. Residual error was the largest component 
of variance, accounting for 72.8% of the total. In contrast, 
when performance was scored more leniently, in terms of 
syllable groups correct, individual differences among sub- 
jects accounted for the greatest percentage of variance 
(45.8%), and residual error accounted for only 27.8% of the 
total. The systematic nature of visual-phonemic confusions 
is reflected in the syllable-group scores, which vary more 
from person to person than do syllables correct. 

Implications for test development. Typical consider- 
ations in testing with nonsense syllables concern the num- 
ber of tokens per syllable and the number of repetitions of 
each token. It is assumed here that the minimum "test" 
consists of presenting a single token of each syllable by a 
given talker on a given day. The model for this score is 
represented as: 



X = a-i + a 2 + a 3 + a 4 + a 5 + a 6 + a 7 + a 8 + a 9 + ot w + e 

(D 

where the parameters a., to a 10 correspond to the effects 
listed in Table 2, and e represents residual error. Consider a 
testing model (Model 1) in which n repetitions of at? tokens by 
a given talker are presented on a given day. In this context, 
"day" does not refer to a particular calendar day, but rather 
to the ordinal occasion of testing (i.e., first test occasion, 
second test occasion, etc.). Test score interpretation with 
this model involves generalization over tokens and days, but 
not over talkers. That is, the score is interpreted as a 
talker-specific score, but not a token-specific score nor a 
day-specific score. 
The formula for the generalizability coefficient for Model 1 

is: 



Q-1 + Oe 

2 2 

(7? + 0% + <j\ + — + fffo + 

1 5 b m 10 mn 



(2) 



The numerator represents universe-score variance and the 
denominator represents observed-score variance. Because 
day, talker, and token are held constant, their main effects 
and interactions do not contribute either to universe-score 
variance or observed-score variance. In the numerator, 
universe-score variance consists of the variance compo- 
nents for subjects and for the Subject x Talker interaction. 
This shows clearly that the universe score is talker-specific. 
In the denominator, the variance components for Subject x 
Token and residual error are reduced by a factor of m, the 
number of tokens, and residual error variance is further 
reduced by a factor of n, the number of repetitions of each 
token. Estimated generalizability coefficients for Model 1 , for 
both performance measures, are shown in Table 3 as a 
function of the number of repetitions and the number of 
tokens. 

The syllables-correct section of the table is symmetrical 
because for that dependent variable, the estimated variance 
component for Subject X Token is zero. Increasing test 

TABLE 2. Estimated variance components for two measures of 
speechreading performance on CV syllables. 



Syllables 
correct 



Syllable groups 
correct 



Source 


Variance 
estimate 


% 


Variance 
estimate 


% 


1. Subject (S) 


.00140 


15.9 


.00544 


45.8 


2. Day (D) 


.00012 


1.3 


.00039 


3.3 


3. Talker (T) 


.00024 


2.8 


.00120 


10.2 


4. Token/Talker (To/T) 


.00021 


2.3 


.00006 


0.5 


5. S x D 


0 


0 


.00004 


0.3 


6. S x T 


.00042 


4.8 


.00051 


4.3 


7. S x To/T 


0 


0 


.00030 


2.5 


8. D x T 


0 


0 


.00002 


0.2 


9. D x wr 


0 


0 


0 


0 


10. S X D X T 


0 


0 


.00060 


5.1 


11. Error 


.00640 


72.8 


.00330 


27.8 


Note. Negative estimates of variance have been set to zero. 
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TABLE 3. Generalizability coefficients for two measures of speechreading performance on CV 
syllables as a function of number of tokens and number of repetitions (Model 1). 



Tokens 

Repetitions 1 2 3 4 5 10 



Syllables correct 



1 


0.222 


0.363 


0.461 


0.533 


0.588 


0.740 


1.000 


2 


0.363 


0.533 


0.631 


0.695 


0.740 


0.851 


1.000 


3 


0.461 


0.631 


0.720 


0.774 


0.811 


0.895 


1.000 


4 


0.533 


0.695 


0.774 


0.820 


0.851 


0.919 


1.000 


5 


0.588 


0.740 


0.811 


0.851 


0.877 


0.934 


1.000 


10 


0.740 


0.851 


0.895 


0.919 


0.934 


0.966 


1.000 


00 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 








Syllable groups correct 








1 


0.584 


0.710 


0.764 


0.795 


0.815 


0.857 


0.904 


2 


0.697 


0.787 


0.822 


0.841 


0.853 


0.878 


0.904 


3 


0.745 


0.817 


0.844 


0.858 


0.867 


0.885 


0.904 


4 


0.772 


0.832 


0.855 


0.867 


0.874 


0.888 


0.904 


5 


0.789 


0.842 


0.862 


0.872 


0.878 


0.891 


0.904 


10 


0.825 


0.862 


0.876 


0.882 


0.887 


0.895 


0.904 


00 


0.864 


0.883 


0.890 


0.893 


0.895 


0.899 


0.904 



Note. Under Model 1 , the observed score is the mean of one or more repetitions of one or more tokens 
of each of 23 syllables for a given talker on a given day. Generalization is over tokens and days, but 
not talkers. 



length by adding tokens or repetitions has the same effect 
on generalizability, for both act only through the term a^/mn. 
Interestingly, the asymptotic values for generalizability are 
unity, despite the fact that the formula for generalizability 
(Equation 2) does not approach one as a limit when m and n 
approach infinity. This results from the zero estimates of of 
and a? 0 . The value of .800 is here selected as a minimum 
acceptable generalizability. Substitution of various values of 
m and n in the estimation formula reveals that it would take 
a test 1 4 times longer than the 23-item minimum test (322 
items) to achieve this level. This could be accomplished by 
presenting each token seven times, or by holding token 
constant and presenting each syllable 14 times. 

For small numbers of tokens and/or repetitions, when 
conditions of testing are held constant, generalizability is 
higher for syllable groups correct. This does not hold for 
large numbers of tokens or repetitions, because generaliz- 
ability does not approach one as limit. 1 By interpolation, six 
repetitions of a single token (138 items) would have a 
generalizability coefficient of .800. In principle, if more to- 
kens were available, generalizability of .795 could be 
achieved with a single presentation of four tokens (92 items). 
Thus, acceptable generalizability could be obtained more 
efficiently with multiple tokens than with multiple presenta- 
tions of the same token. 

A modification to Model 1 is to include talker in the 
universe of generalization. With this change in test-score 
interpretation (but not test administration procedures), gen- 
eralizability is 



P 2 = 



~2 

+ of + of + — 



+ <7l0 + 



mn 



(3) 



Model 2 is the same as Model 1 except that there is no 
variance component for the Subject x Talker interaction in 
the numerator of the generalizability coefficient. For a given 
testing protocol, generalizability is necessarily lower under 
Model 2 than under Model 1 , unless the variance component 
for talker is zero. 

Generalizability coefficients estimated under Model 2 are 
shown in Table 4. For syllables correct, generalizability is not 
acceptably high even for an infinitely long test. Testing with 
one talker on one day and then generalizing over talkers and 
days, as well as tokens, is simply not feasible. In contrast, 
when performance is measured in terms of syllable groups 
correct, it is possible to achieve acceptable generalizability, 
provided the test is sufficiently long. For example, if there 
were four tokens of each syllable and each were presented 
five times (460 items), it is estimated that generalizability 
would be .797. With the existing two tokens, more than 1 0 
presentations of each would be needed. 

Because there are nontrivial effects of talker and Subject 
x Talker interaction, a better way to achieve generalizability 
across the two talkers would be to present syllables from 
both talkers. Under Model 3, the minimum test consists of 
46 items: one token of each of the 23 syllables from each of 
the talkers. The universe of generalization is the same as for 
Model 2, and universe-score variance is the same, but 
observed-score variance is reduced. Generalizability is 



2 2 

5 m 10 mn 



(4) 



1 The authors thank Alan Cobo-Lewis for pointing this out. 



Estimated generalizability coefficients for Model 3 are pre- 
sented in Table 5. 

For syllables-correct, generalizability of .797 could be 
obtained using three presentations of three tokens by each 
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TABLE 4. Generalizability coefficients for two measures of speechreading performance on CV 
syllables as a function of number of tokens and number of repetitions (Model 2). 



Tokens 



RenAtitions 


1 


2 


3 


4 


5 


10 


00 








Syllables correct 








1 


.170 


.279 


.354 


.409 


.451 


.568 


.767 


2 


.279 


.409 


.484 


.533 


.568 


.653 


.767 


3 


.354 


.484 


.552 


.594 


.622 


.687 


.767 


A 


.4U3 


.533 


.594 


.629 


.653 


.705 


.767 


5 


.451 


.568 


.622 


.653 


!673 


.717 


767 


10 


.568 


.653 


.687 


.705 


.717 


.741 


.767 


00 


.767 


.767 


.767 


.767 


.767 


.767 


.767 








Syllable groups correct 








1 


.534 


.648 


.698 


.726 


.744 


.783 


.826 


2 


.637 


.719 


.752 


.769 


.780 


.802 


.826 


3 


.681 


.746 


.771 


.784 


.792 


.809 


.826 


4 


.705 


.761 


.781 


.792 


.798 


.812 


.826 


5 


.721 


.770 


.787 


.797 


.802 


.814 


.826 


10 


.754 


.788 


.800 


.806 


.810 


.818 


.826 


on 


.790 


.807 


.813 


.816 


.818 


.822 


.826 


Note. Under Model 2 the observed score is the mean of one or more repetitions of one or more tokens 
of each of 23 syllables for a given talker on a given day. Generalization is over tokens, talkers, and 



days. 



talker (414 items). Other combinations of tokens and repe- 
titions yield even higher generalizability, but a considerable 
investment in testing time is required. For syllable groups 
correct, generalizability is quite high under Model 3. A single 
presentation of one token from each talker (46 items) yields 
generalizability of .748, and presentation of two tokens 
raises this to .853. 

Comparison of the results in Tables 4 and 5 requires that 
the total number of tokens be considered. The comparable 
columns of the two tables are as follows: (a) Model 2, 2 
tokens, versus Model 3, 1 token; (b) Model 2, 4 tokens, 
versus Model 3, 2 tokens; and (c) Model 2,10 tokens, versus 



Model 3, 5 tokens. Each of these comparisons reveals that 
generalizability is higher when additional tokens represent 
both talkers rather than a single talker. Thus, with the total 
number of items (and hence testing time) held constant, it is 
best to present multiple tokens of both talkers. 

Models 1-3 contain a variance component for the inter- 
action of Subject x Day that is not affected by lengthening 
the test with additional tokens or repetitions. Fortunately, 
this is of little consequence for these CV syllables because 
the estimated Subject x Day variance components for 
syllables correct and syllable groups correct were zero and 
0.00004, respectively. Nevertheless, if n repetitions of m 



TABLE 5. Generalizability coefficients for two measures of speechreading performance on CV 
syllables as a function of number of tokens and number of repetitions (Model 3). 



Tokens 



Repetitions 


1 


2 


3 


4 


5 


10 


00 








Syllables correct 








1 


.304 


.467 


.568 


.636 


.686 


.814 


1.000 


2 


.467 


.636 


.724 


.778 


.814 


.897 


1.000 


3 


.568 


.724 


.797 


.840 


.868 


.929 


1.000 


4 


.636 


.778 


.840 


.875 


.897 


.946 


1.000 


5 


.686 


.814 


.868 


.897 


.916 


.956 


1.000 


10 


.814 


.897 


.929 


.946 


.956 


.978 


1.000 


00 


1 .000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 








Syllable groups correct 








1 


.748 


.853 


.895 


.918 


.932 


.962 


.994 


2 


.843 


.912 


.938 


.951 


.959 


.976 


.994 


3 


.881 


.934 


.953 


.963 


.969 


.981 


.994 


4 


.901 


.945 


.961 


.969 


.974 


.984 


.994 


5 


.913 


.952 


.965 


.972 


.976 


.985 


.994 


10 


.940 


.966 


.975 


.980 


.982 


.988 


.994 


00 


.967 


.980 


.985 


.987 


.988 


.991 


.994 



Note. Under Model 3 the observed score is the mean of one or more repetitions of one or more tokens 
of each of 23 syllables for each talker on a given day. Generalization is over tokens, talkers, and days. 
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tokens were presented on each of k days, the variance 
components for Subject x Day and for residual error would 
be divided by k. Repetition across days increases general- 
izability more than repetition within days. 

Words 

Descriptive statistics. Means and standard deviations 
for three measures of performance on MRT words are 
shown in Table 6 as a function of set order, day, and set. 
Overall, only 8.9% of the words were correctly identified, but 
1 .22 phonemes per word (40.6%) were correct. Inspection 
of the stimulus-response alignments produced by the se- 
quence comparator revealed that there were few phoneme 
deletions (M = 0.44 per word) and that phoneme substitu- 
tion errors were highly plausible. In terms of the visual 
distance metric (where higher visual distance represents 
poorer performance), mean normalized visual distance was 
5.86. A visual distance of this magnitude corresponds to 
individual substitution errors such as 16/ for /s/, which are 
plausible on perceptual grounds. Mean performance was 
virtually identical for the two orders in which the MRT word 
sets were presented. Day 2 performance was better than 
Day 1 performance for the WC and VDNS measures and Set 
1 resulted in more words correct. 

Structural model. The unit of observation was the aver- 
age across the two words from a given ensemble of the 
MRT, administered on a given day. Each subject therefore 
had 1 00 scores on each measure, 50 for Day 1 and 50 for 
Day 2. Set order was considered a fixed, between-subjects 
factor; day and ensemble were considered random, within- 
subjects factors. Because each subject received Set 1 on 
one day and Set 2 on the other, the interaction of Set Order 
x Day is equivalent to the main effect of set. Likewise, the 
interaction of Set Order x Day x Ensemble is equivalent to 
the interaction of Set x Ensemble. Because the assignment 
of word pairs to Sets 1 and 2 was arbitrary, this interaction 
represents differences in difficulty of the word pairs within 
each ensemble. The highest-order interaction, Subject x 
Day x Ensemble, was assumed to be zero. 



Estimation of variance components. The variance ex- 
plained by each effect is shown in Table 7. For each 
measure, the largest source of variance is residual error and 
the largest systematic source is typically ensemble. The 
four-word ensembles of the MRT differ substantially in 
difficulty as stimuli for speechreading, as do the pairs of 
words within each ensemble that were assigned to Set 1 and 
Set 2 (see estimates for the Set x Ensemble interaction). 
Other nontrivial effects are those for subject and the inter- 
action of Subject x Ensemble. One important result is that 
the main effect of day and the Subject x Day interaction are 
not important sources of variability in the test scores 
(<1.0%) in this population. 

Implications for test development Because MRT words 
were recorded for only one talker, observed scores must be 
considered talker-specific. Also, different word pairs from 
each ensemble were presented on different days, so day- 
to-day variability reflects not only temporal factors associ- 
ated with the test occasion but also random variation arising 
from testing with different words. Mean performance was 
virtually identical for the two orders in which the MRT word 
sets were presented. 

The minimum test with these MRT words consists of one 
presentation of a pair of words from one ensemble of a given 
set on a given day. The model for this score has the same 
form as Equation 1, but the parameters represent the 
sources of variance shown in Table 7. Consider a testing 
model in which the observed score is the mean over n 
presentations of m such two-word ensembles on a given 
day. The universe of generalization is across ensembles and 
days. The formula for the generalizability coefficient is: 



P 2 = 



erf + oi + 



<A 0 <T\ 

m 



2 
e 

mn 



(5) 



Generalizability coefficients were estimated for each mea- 
sure under this model, using the variance estimates in Ta- 



TABLE 6. Descriptive statistics for three measures of speechreading performance on MRT 
words as a function of set order, day, and set. 





WC 




PCNS 




VDNS 






M 


SD 


M 


SD 


M 


SD 


All conditions 


.089 


.046 


.406 


.081 


5.86 


1.34 


Between-subjects 














Set 1 first 
Set 2 first 


.089 
.089 


.049 
.044 


.405 
.407 


.087 
.076 


5.82 
5.91 


1.37 
1.33 


Within-subjects 














Day 1 
Day 2 


.085 
.094 


.050 
.055 


.402 
.409 


.085 
.084 


5.94 
5.79 


1.42 
1.36 


Set 1 
Set 2 


.105 
.074 


.056 
.044 


.407 
.404 


.087 
.082 


5.88 
5.84 


1.42 
1.35 



Note. WC = Words correct; PCNS = phonemes correct normalized on stimulus length in phonemes; 
VDNS = visual distance normalized on stimulus length in phonemes. For between-subjects statistics, 
N = 96; for within-subjects statistics, n = 48. 
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TABLE 7. Estimated variance components for three measures of speechreading performance 
on MRT words. 



WC PCNS VDNS 



Source 


Var. est. 


% 


Var. est. 


% 


Var. est. 


% 


1. SO 


0 


0 


0.0000 


0.0 


0 


0 


2. S/SO 


0.1756 


3.7 


0.0060 


9.1 


1.69 


12.8 


3. D 


0.0028 


0.1 


0.0000 


0.0 


0.01 


0.1 


4. E 


0.4286 


9.1 


0.0129 


19.4 


2.30 


17.4 


5. D X E 


0 


0 


0.0000 


0.0 


0 


0 


6. Set 3 


0.0383 


0.8 


0 


0 


0 


0 


7. SO X E 


0 


0 


0 


0 


0 


0 


8. Set X E b 


0.4896 


10.4 


0.0076 


11.4 


0.98 


7.4 


9. S/SO x D 


0.0052 


0.1 


0.0004 


0.6 


0.09 


0.7 


10. S/SO x E 


0.2100 


4.5 


0.0064 


9.6 


0.51 


3.9 


11. Error 


3.3600 


71.3 


0.0330 


49.8 


7.61 


57.7 



Note. WC = Words correct; PCNS = phonemes correct normalized on stimulus length in phonemes; 

VDNS = visual distance normalized on stimulus length in phonemes. SO = set order, S = subject, D 

= day, E = ensemble. Negative estimates of variance have been set to zero. 

a The interaction of Set Order x Day is equivalent to the main effect of set. 

b The interaction of Set Order x Day x Ensemble is equivalent to the interaction of Set x Ensemble. 



ble 7 and several values of m and n. Results are shown in 
Table 8. 

For words correct, only seven of the tabled values for tests 
of finite length exceed the arbitrary minimum of .800. This 
level of generalizability could be achieved (p 2 = .803) by 
presenting the two words in each of 50 ensembles twice 



TABLE 8. Generalizability coefficients for three measures of 
speechreading performance on MRT words as a function of 
number of two-word ensembles and number of repetitions. 



Number of ensembles 

Number of 

repetitions S 10 25 50 °° 



Words correct 



1 


.196 


.327 


.543 


.696 


.971 


2 


.314 


.475 


.685 


.803 


.971 


3 


.393 


.560 


.750 


.847 


.971 


4 


.449 


.614 


.788 


.870 


.971 


5 


.492 


.653 


.813 


.885 


.971 


10 


.606 


.746 


.867 


.916 


.971 


00 


.807 


.893 


.954 


.977 


1.000 




Phonemes correct normalized on stimulus length 




1 


.422 


.583 


.755 


.838 


.940 


2 


.549 


.693 


.823 


.878 


.940 


3 


.610 


.740 


.848 


.892 


.940 


4 


.646 


.766 


.862 


.899 


.940 


5 


.669 


.782 


.870 


.904 


.940 


10 


.722 


.817 


.887 


.913 


.940 


00 


.825 


.904 


.959 


.979 


1.000 




Visual distance normalized on stimulus length 




1 


.497 


.652 


.803 


.870 


.950 


2 


.640 


.764 


.866 


.906 


.950 


3 


.708 


.811 


.889 


.918 


.950 


4 


.747 


.836 


.901 


.925 


.950 


5 


.773 


.852 


.908 


.929 


.950 


10 


.831 


.886 


.923 


.936 


.950 


00 


.943 


.971 


.988 


.994 


1.000 



Note. The observed score is the mean of one or more repetitions of 
five or more two-word ensembles presented on a given test day. 
Within a given ensemble, different words are presented on different 
days. Generalization is over ensembles and days. 



(200 words). Alternatively, 25 ensembles could be selected, 
with each pair of words presented five times (p 2 = .813; 250 
words). The trade-off between lengthening the test through 
addition of ensembles rather than repetitions is apparent, as 
it was for the CV syllables. Because differences among 
ensembles tend to be an important source of variability in 
the observed scores, adding ensembles and sampling con- 
tent more broadly improves generalizability more than sim- 
ply repeating a given selection of ensembles. 

The measures of phonemes correct and visual distance 
both result in acceptable generalizability coefficients for 
relatively brief tests. Twenty-five ensembles, each presented 
twice (100 words), are predicted to yield a generalizability 
coefficient of .823 for normalized phonemes correct. One 
presentation of 25 ensembles (50 words) is expected to 
produce generalizability of .803 for visual distance. Clearly, 
there are meaningful individual differences among subjects 
in their ability to speechread these monosyllabic words and 
these differences are more apparent in their phoneme and 
visual distance scores than in their words-correct scores. 

As Equation 5 shows, the variance component for the 
interaction of Subject x Day is not affected by adding 
ensembles or additional repetitions. However, because the 
magnitude of this variance component is quite small, there is 
little to be gained by testing on more than one day. Never- 
theless, if k represents the number of test days, the pre- 
dicted generalizability coefficients can be derived from 
Equation 5 by dividing the variance components for Subject 
x Day and residual error by k. 

CID Sentences 

Descriptive statistics. Means and standard deviations 
for the six performance measures are shown in Table 9 as a 
function of set order, day, talker, and set. Before normaliza- 
tion, an average of 1.86 words and 6.16 phonemes per 
sentence was correct. Normalized visual distance (VDNS) 
was slightly higher than that obtained with MRT words (6.61 
vs. 5.86). 
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TABLE 9. Descriptive statistics for three measures of speechreading performance on CID 
Sentences as a function of set order, day, talker, and set. 





WCNS 




PCNS 






VDNS 


M 


SO 


M 


SO 


M 


SD 


All conditions 


.284 


.130 


.335 


.132 


6.61 


1.37 


Between subjects 














Set 1 1st 


.278 


.131 


.330 


.136 


6.68 


1.35 


Set 2 1st 


.290 


.130 


.340 


.129 


6.54 


1.40 


Within subjects 














Day 1 


.274 


.130 


.326 


.132 


6.68 


1.34 


Day 2 


.294 


.138 


.344 


.139 


6.54 


1.44 


Talker 1 


.355 


.145 


.398 


.145 


6.15 


1.54 


Set 1 


.312 


.151 


.365 


.152 


6.35 


1.60 


Set 2 


.398 


.149 


.431 


.148 


5.94 


1.59 


Talker 2 


.213 


.120 


.272 


.125 


7.07 


1.27 


Set 1 


.214 


.131 


.273 


.137 


6.99 


1.38 


Set 2 


.213 


.118 


.271 


.120 


7.15 


1.23 



Note. WCNS = Words correct normalized on stimulus length in words; PCNS = phonemes correct 
normalized on stimulus length in phonemes; VDNS = visual distance normalized on stimulus length in 
phonemes. For between-subjects statistics, N = 96; for within-subjects statistics, n = 48. 



Differences between Set 1 and Set 2 were quite small, as 
were differences between performance on Day 1 and Day 2. 
Although an attempt had been made to equalize the difficulty 
of the two talkers, the means in Table 9 suggest that the 
assignment of sentences to talkers was unduly biased in 
favor of Talker 1 , resulting in an apparent reversal of the 
relative difficulty of the two talkers (cf . Demorest & Bernstein, 
1992). In addition, the two sets of sentences for Talker 1 
appear to differ more than the two sets for Talker 2. 

Structural model. The unit of analysis for all measures 
was the score on a single test item (i.e., a single sentence). 
Each subject thus had 100 scores on each measure: 25 
sentences (items) in each of two sets for each talker. For 
each talker, one set of sentences was arbitrarily designated 
Set 1 and the other Set 2. Set order was considered a fixed, 
between-subjects factor. Within-subjects factors were 
talker, set within talker, and item within set within talker. 
Talker was considered a fixed factor; set and item were 
considered random. Because each subject received Set 1 
on one day and Set 2 on the other, the interaction of Set 
Order x Set is equivalent to the main effect of day. Likewise, 
the interaction Set Order x Item is equivalent to Day x item. 
The highest-order interaction (Subject x Item) was assumed 
to be zero. 

Estimation of variance components. The estimated 
variance attributable to each effect in the structural model is 
shown in Table 10. Although residual variance accounts for 
the largest percentage of total variance for most measures, 
there are also large effects associated with subjects and 
items (i.e., sentences). The only other nontrivial effect is that 
for talker. 

Implications for test development. Considerations that 
often arise in testing speechreading with sentences include 
the number of test items, the talker, and the effects of testing 
on more than one day. If the minimum test of speechreading 



is considered a single sentence of a given talker adminis- 
tered on a given day, the effects on the observed test score 
can be modeled by Equation 1, with effect parameters 
defined in terms of the sources of variance in Table 10. 

Generalizability coefficients have been derived, as a func- 
tion of test length, for five testing models. Formulas for the 
generalizability coefficients are given in Table 11. Under 
Model 1 , m sentences from a single set of a single talker are 
administered on a given day and the observed score is the 
mean across the m sentences. Set order (and hence day) is 
the same for all subjects. The universe of generalization is 
across items, sets, and set orders, but not talkers. Because 
different sets are administered on different days, generaliza- 
tion across sets includes not only differences attributable to 
the sets themselves, but also the effects of testing on 
different days. Note that set is a random effect, so this model 
does not constrain the sets to be defined as they have been 
in the present experiment. In the numerator of the general- 
izability coefficient, universe-score variance contains com- 
ponents for subjects and the interaction of Subject x Talker. 
In the denominator, observed-score variance contains these 
effects plus the interaction of Subject x Set and residual 
error. 

If generalization across talkers is also desired, the Subject 
x Talker interaction disappears, and generalizability is re- 
duced (see Model 2). Given that sets were nested within 
talkers, generalization across talkers also involves generali- 
zation across the different sets of sentences assigned to the 
two talkers. Set order is constant in Models 1 and 2, so 
subjects are always tested with different sets in a fixed order. 
In some situations, however, sets are administered in differ- 
ent orders to different subjects. To accommodate this ap- 
proach to data collection, generalizability coefficients were 
derived for a protocol that is the same as for Models 1 and 
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TABLE 10. Estimated variance components for three measures of speechreading performance 
on CIO Sentences. 



Source 


WCNS 




PCNS 




VDNS 




Var. est. 


% 


Var. est. 


% 


Var. est. 


% 


1. SO 


0 


0 


0 


0 


0 


0 


2. S/SO 


0.0163 


13.8 


0.0170 


16.0 


1.82 


15.6 


3. T 


0.0046 


3.9 


0.0037 


3.5 


0.20 


1.7 


4. Set/T 


0.0008 


0.6 


0.0001 


0.1 


0 


0 


5. I/Set/T 


0.0260 


22.0 


0.0248 


23.4 


2.06 


17.7 


6. SO x T 


0 


0 


0 


0 


0 


0 


7. Day 3 


0.0002 


0.1 


0.0002 


0.1 


0.01 


0.1 


8. Day x l/Set/T* 


0 


0 


0.0000 


0.0 


0.00 


0.0 


9. S/SO x T 


0.0004 


0.3 


0.0003 


0.3 


0.07 


0.6 


10. S/SO x Day° 


0 


0 


0.0000 


0.0 


0 


0 


11. Error 


0.0700 


59.2 


0.0600 


56.6 


7.47 


64.3 



Note. WCNS = words correct normalized on stimulus length in words; PCNS = phonemes correct 
normalized on stimulus length in phonemes; VDNS = visual distance normalized on stimulus length in 
phonemes. SO = set order, S = subject, T = talker, I = item. Negative estimates of variance have 
been set to zero. 

a The interaction of SO x Set/T is equivalent to the main effect of day. 

b The interaction of SO x Item/Set/T is equivalent to the interaction of Day x Item/Set/Talker. 

c The interaction of S/SO x Set/T is equivalent to the interaction of S/SO x Day. 



2, except that set order varies among subjects (Models 3 
and 4). 

Comparison of formulas for Models 3 and 4 with those for 
Models 1 and 2 reveals additional components in the 
numerator for the main effect of set order and the interaction 
of Set Order x Talker. These components reflect the con- 
founding of set order with the individual differences among 
subjects. However, because the magnitude of both effects 



TABLE 11. Generalizability coefficients (p 2 ) for five models of 
testing speechreading with CID Sentences. 



Testing 
model 



of + of 



of + oi + a? 0 + 



°5 



of 

of + Og + of o + ^ 

of + of + of + of 



<Ao ff 
m m 



^2 

of + of + of + of + ^ + of 
of + of + of 

p ? 9 p °8 , ofo °e 

of + of + of + of + - + of ' 



+ - 
m m 



of + of 



p o , Of , of 

of + of + of + — + of 0 + — 



Note. Subscripts on variance components represent sources of 
variance: 1 = set order (SO); 2 = subject/set order (S/SO); 6 = 
SO x T; 7 = SO x Set/T; 8 = SO x Item/set/T; 9 = S/SO x T; 
10 = S/SO x Set/T; e = error. In denominators, m = number of 
items. 



was estimated to be zero for all measures, the confounding 
has no consequences for test score interpretation. An addi- 
tional component in the denominator for Models 3 and 4 is 
the interaction of Set Order x Item. Given the small esti- 
mated magnitude of this effect and the fact that it is divided 
by m, its impact on the generalizability coefficient is quite 
small. 

Testing with a single talker results in either a talker- 
specific universe score (Models 1 and 3) or reduced gener- 
alizability (Models 2 and 4). An alternative is to test with m 
items, half of which are taken from a single set of Talker 1 
and half of which are taken from a single set of Talker 2. 
Model 5 represents this protocol, with set order allowed to 
vary among subjects. The consequence of testing with both 
talkers is that variance components for the interaction of Set 
Order x Talker and for the interaction of Subject x Talker 
are no longer relevant. Given the negligible magnitude of 
these effects, the effects on generalizability can be expected 
to be small. 

Generalizability coefficients for Models 1-5 were esti- 
mated from the variance components in Table 1 0 for each 
measure and for values of m = 1 0, 20, 30, 40, 50, 1 00, and 
infinity. The results are presented in Table 12. Regardless of 
model or performance measure, estimated generalizability is 
>.800 for m > 20. (For m = 40, the estimates approach or 
exceed .900. These results replicate the findings of Demo- 
rest and Bernstein (1992), who evaluated five similar models 
for testing with the same recordings of the CID sentences. 
However, in the present study, testing was conducted on 2 
days, thereby permitting estimation of temporal sources of 
variability in test scores as well. The similarity of the studies' 
results is attributable to the negligible variance accounted 
for by the effects of day and Day x Item. These findings 
support the conclusions that excellent generalizability can 
be obtained with these CID sentences by testing with as few 
as 40 sentences and that it is not very important whether a 
particular set of sentences is administered on Day 1 or Day 
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TABLE 12. Generalizability coefficients for three measures of speechreading performance on 
CID Sentences as a function of number of sentences and five models of testing. 



Number Testing model 

of — 

sentences 12 3 



Normalized words correct 

10 .705 .688 .700 .683 .694 

20 .827 .807 .820 .800 .816 

30 .877 .856 .870 .849 .867 

40 .905 .883 .897 .875 .895 

50 .923 .901 .914 .892 .912 

100 .960 .937 .950 .928 .949 

°° 1.000 .976 .990 .966 .990 

Normalized phonemes correct 

10 .742 .730 .738 .725 .734 

20 .852 .837 .846 .831 .844 

30 .896 .881 .889 .874 .888 

40 .920 .904 .913 .897 .912 

50 .935 .919 .928 .912 .926 

100 .966 .950 .958 .942 .958 

°° 1.000 .983 .991 .974 .991 

Visual distance normalized on stimulus length 

10 .716 .691 .714 .689 .706 

20 .835 .805 .831 .802 .826 

30 .883 .852 .880 .849 .876 

40 .910 .878 .906 .874 .903 

50 .927 .894 .923 .890 .920 

100 .962 .928 .958 .924 .956 

oo 1.000 .965 .996 .961 .995 



Note. Model 1 : Test with a single talker, set, and set order; generalize over sets (days). Model 2: Same 
as Model 1 , but also generalize over talkers. Model 3: Test with a single talker and set, but both set 
orders; generalize over sets (days) and set orders. Model 4: Same as Model 3, but also generalize over 
talkers. Model 5: Test with half the items by one talker and half by the other; use a single set from each 
talker; use both set orders; generalize over sets (days), set orders, and talkers. 



2 to a given subject. It is also feasible to generalize across 
talkers even if testing is conducted with only a single talker 
(Model 4 vs. Model 5). 

The present study also extended the results of Demorest 
and Bernstein (1992) by including new measures of perfor- 
mance. Comparison of the panels of Table 12 reveals little 
difference among the performance measures: Each provides 
a conceptually distinct and psychometrically sound basis for 
measuring individual differences in speechreading sen- 
tences among subjects with normal hearing. This is partic- 
ularly interesting because normalized words correct and 
normalized phonemes correct are measures of correct per- 
formance, whereas normalized visual distance quantifies not 
only the number, but also the magnitude, of subjects' errors. 
That is, subjects who make a large number of errors make 
more serious errors and subjects who make few errors make 
less serious ones. 

One issue not addressed by Models 1-5 is the effect of 
defining the observed score as a mean score across more 
than one day of testing. If m items are administered on each 
of k days under Models 1-5, the generalizability coefficients 
shown in Table 12 are affected as follows: First, variance 
components for day and Subject x Set are divided by k. 
Because these effects are quite small, there is little gained by 
testing on more than one day. Second, variance compo- 
nents that are already divided by m are also divided by k. 



This reflects the lengthening of the test that occurs by using 
mk items instead of m. 

B-E Sentences 

Data for the B-E Sentences were obtained under the same 
experimental conditions as the CID sentences. Hence, the 
same structural model was used for estimating variance 
components, and the same psychometric models were used 
for examination of generalizability. 

Descriptive statistics. Means and standard deviations 
are shown in Table 13 as a function of set order, day, talker, 
and set. An average of 1 .20 words and 4.67 phonemes per 
sentence was correct. Overall, the B-E Sentences appear to 
be more difficult to speechread than the CID sentences (cf. 
Table 9). The possibility that this was due to a greater 
number of nonresponses to the B-E Sentences was exam- 
ined, but this does not appear to be the case (CID, 14.5% 
nonresponse; B-E, 1 5.9%). Across all six measures, means 
for the two set orders and the 2 days of testing were virtually 
identical. The talker differences suggested in Table 9 are not 
so large as those seen in Table 13, but, more importantly, 
the previously observed difference favoring Talker 2 (Demo- 
rest & Bernstein, 1992) appears in these data. Finally, there 
is evidence that the two sets of sentences for each talker 
differed somewhat in overall difficulty. 
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TABLE 13. Descriptive statistics for three measures of speechreading performance on Bern 
stein-Eberhardt Sentences as a function of set order, day, talker, and set. 





WCNS 


PCNS 




VDNS 




M 


SD 


M 


SD 


M 


SD 


All conditions 


.205 


r\oc 
.Uoo 


.265 


.096 


7.38 


1.06 


Between subjects 














Set 1 1st 


.202 


.089 


.262 


.104 


7.43 


1.05 


Set 2 1st 


.207 


.081 


.268 


.088 


7.33 


1.07 


Within subjects 














Day 1 


.204 


.098 


.264 


.096 


7.40 


1.06 


Day 2 


.205 


.096 


.266 


.101 


7.36 


1.10 


Talker 1 


.181 


.080 


.235 


.092 


7.56 


0.99 


Set 1 


.156 


.066 


.209 


.077 


7.85 


0.89 


Set 2 


.205 


.104 


.261 


.115 


7.27 


1.23 


Talker 2 


.228 


.097 


.295 


.108 


7.20 


1.18 


Set 1 


.279 


.112 


.331 


.123 


6.85 


1.35 


Set 2 


.178 


.093 


.258 


.104 


7.54 


1.10 



Note. WCNS = Words correct normalized on stimulus length in words; PCNS = phonemes correct 
normalized on stimulus length in phonemes; VDNS = visual distance normalized on stimulus length in 
phonemes. For between-subjects statistics, N = 96; within-subjects statistics, n = 48. 



Estimation of variance components. The estimated 
variance components (see Table 14) are similar to those in 
Table 1 0, in that error variance is the largest component for 
all measures and the largest systematic sources of variance 
are subjects and items. For most effects, the magnitude of 
the variance components is smaller for B-E Sentences, and 
the percentage of variance due to subjects is small for all 
measures. As a consequence, other things being equal, 
generalizability will be lower for the B-E Sentences. 

Implications for test development. Generalizability co- 
efficients for Models 1-5 (see Table 11) are presented in 
Table 15. As expected, the values are consistently lower 
than those in Table 12 for the CID sentences. To achieve a 



specified level of generalizability, it would be necessary to 
test with a greater number of B-E sentences than CID 
sentences. For example, depending on the model and the 
performance measure, 30 to 50 sentences would be needed 
to obtain generalizability of .800. 

Generalization across the two talkers is supported by the 
results in Table 15, even when testing is carried out with only 
a single talker (Models 1-4). The slight reduction in the 
estimated coefficients (Model 1 vs. Model 2 and Model 3 vs. 
Model 4) should not be of concern in practical testing 
situations. Moreover, testing with both talkers does not 
result in appreciably higher generalizability. As with the CID 
Sentences, it is possible to estimate generalizability coeffi- 



TABLE 14. Estimated variance components for three measures of speechreading performance 
on Bernstein-Eberhardt Sentences. 





Source 


WCNS 




PCNS 




VDNS 




Var. est. 


% 


Var. est. 


% 


Var. est. 


% 


1. 


SO 


0 


0 


0 


0 


0 


0 


2. 


s/so 


0.0066 


9.3 


0.0087 


12.1 


1.05 


14.3 


3. 


T 


0 


0 


0.0004 


0.5 


0 


0 


4. 


Set/T 


0.0023 


3.2 


0.0012 


1.6 


0.15 


2.0 


5. 


l/Set/T 


0.0211 


29.8 


0.0206 


28.7 


1.27 


17.4 


6. 


SO x T 


0.0000 


0.0 


0 


0 


0 


0 


7. 


Day 3 


0 


0 


0 


0 


0 


0 


8. 


Day x l/Set/T b 


0 


0 


0 


0 


0 


0 


9. 


S/SO x T 


0.0002 


0.3 


0.0002 


0.3 


0 


0 


10. 


S/SO x Day 0 


0.0008 


1.1 


0.0008 


0.1 


0.11 


1.4 


11. 


Error 


0.0400 


56.3 


0.0400 


55.7 


4.76 


64.9 



Wofe. WCNS = words correct normalized on stimulus length in words; PCNS = phonemes correct 
normalized on stimulus length in phonemes; VDNS = visual distance normalized on stimulus length in 
phonemes. SO = Set order, S = subject, T = talker, I = item. Negative estimates of variance have 
been set to zero. 

a The interaction of SO x Set/T is equivalent to the main effect of day. 

b The interaction of SO x Item/Set/T is equivalent to the interaction of Day x Item/Set/Talker. 

c The interaction of S/SO x Set/T is equivalent to the interaction of S/SO x Day. 
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TABLE 15. Generalizability coefficients for three measures of speechreading performance on 
Bernstein-Eberhardt Sentences as a function of number of sentences and five models of 
testing. 



Number of 
sentences 



Testing model 



Normalized words correct 

10 .586 .569 .586 .569 .579 

20 .708 .688 .708 .688 .702 

30 .761 .739 .761 .739 .756 

40 .791 .767 .791 .768 .786 

50 .810 .786 .810 .786 .805 

100 .850 .825 .850 .825 .846 

» .895 .868 .895 .868 .892 

Normalized phonemes correct 

10 .650 .635 .650 .635 .644 

20 .761 .744 .761 .744 .757 

30 .807 .789 .807 .789 .803 

40 .832 .813 .832 .813 .829 

50 .848 .829 .848 .829 .845 

100 .881 .861 .881 .861 .879 

o° .918 .897 .918 .897 .916 

Visual distance normalized on stimulus length 

10 .644 .644 .644 .644 .644 

20 .754 .754 .754 .754 .754 

30 .799 .799 .799 .799 .799 

40 .824 .824 .824 .824 .824 

50 .840 .840 .840 .840 .840 

100 .873 .873 .873 .873 .873 

oo .909 .909 .909 .909 .909 

Note. Model 1 : Test with a single talker, set, and set order; generalize over sets (days). Model 2: Same 
as Model 1 , but also generalize over talkers. Model 3: Test with a single talker and set, but both set 
orders; generalize over sets (days) and set orders. Model 4: Same as Model 3, but also generalize over 
talkers. Model 5: Test with half the items by one talker and half by the other; use a single set from each 
talker; use both set orders; generalize over sets (days), set orders, and talkers. 



cients for testing with m items on each of k days. However, 
the variance components for day and Subject x Day inter- 
action are not large enough to improve generalizability very 
much when they are divided by k. 



Correlations Among Performance Measures 
and Materials 

Each subject's mean normalized score for each type of 
material was calculated for each performance measure. 
Means for syllables and sentences were taken across both 
talkers. The resulting scores were then correlated (N = 96). 
Results are shown in Table 1 6, which is divided into sections 
separating correlations based on the same versus indepen- 
dent observations. Negative correlations with visual dis- 
tance measures reflect the direction of scoring: Large visual 
distance implies poorer performance. 

The two CV syllable measures correlate only .70. This 
suggests that the strict and lenient scoring methods do not 
rank subjects in the same way. Moreover, for this and all the 
within-materials correlations, the correlation is inflated by 
the part-whole nature of the associations. That is, when the 
syllable per se is correct, the syllable group necessarily is 
also. 



The two CV syllable measures consistently correlate 
about .50 with the word and sentence measures. The 
exception is the correlation between CV syllable groups 
correct and the visual distance measure for MRT words. This 
result reflects the nature of the errors subjects make: Re- 
gardless of the materials, when response phonemes are 
visually similar to stimulus phonemes, syllable group is likely 
to be correct, and the visual distance is small. Even for 
sentences, CV syllable group correlates most highly with 
normalized visual distance. 

The three measures of performance on MRT words are 
highly mutually intercorrelated and correlate almost as 
highly with performance measures on the CID and B-E 
Sentences. When performance measures are the same, the 
correlations range from .81 to .84. An important practical 
implication of these results is the possibility of testing 
speechreading with 200 MRT words for a single talker and 
predicting performance on sentence materials, for both 
talkers, with an acceptable margin of error. 

Within the CID and B-E Sentence sets, the four measures 
of correct performance correlate almost perfectly. This is 
partly an artifact of the way the measures are defined and 
the fact that they are based on the same response. How- 
ever, the correlations among the CID and B-E measures, 
which are based on different sentences, are also quite high. 
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TABLE 16. Correlations among measures of speechreading performance on nonsense sylla- 
bles, MRT words, CID Sentences, and Bernstein-Eberhardt Sentences. 



CID Everyday 

MRT words Sentences B-E Sentences 





Group 


WC 


PCNS VDNS 


WCNS 


PCNS 


VDNS 


WCNS 


PCNS 


VDNS 


CV Syllable 


.70 


.50 


.47 -.53 


.54 


.52 


-.54 


.51 


.46 


-.51 


CV Group 




.57 


.56 -.72 


.55 


.52 


-.59 


.52 


.47 


-.56 


MRT WC 






.86 -.87 


.84 


.82 


-.84 


.83 


.78 


-.79 


MRT PCNS 






-.83 


.80 


.81 


-.75 


.83 


.82 


-.73 


MRT VDNS 








-.75 


-.71 


.81 


-.73 


-.67 


.81 


CID WCNS 










.99 


-.89 


.94 


.89 


-.76 


CID PCNS 












-.85 


.95 


.93 


-.71 


CID VDNS 














-.80 


-.71 


.94 


B-E WCNS 
















.98 


-.73 


B-E PCNS 


















-.64 



Note. CV = consonant-vowel; MRT = Modified Rhyme Test; CID = CID Everyday Sentences; B-E = 
Bernstein-Eberhardt Sentences. WC = words correct; WCNS = words correct normalized on stimulus 
length in words; PCNS = phonemes correct normalized on stimulus length in phonemes; VDNS = 
visual distance normalized on stimulus length in phonemes. All correlations are statistically significant, 
p < .01. 



When performance is measured in the same way, the 
correlations are all .93 or .94. 

General Discussion 

Sources of Variability 

The generalizability analyses presented here demonstrate 
that there are meaningful individual differences in speech- 
reading performance among subjects with normal hearing 
on each of the three types of materials investigated. Sub- 
jects and test items are the most consistent and important 
systematic sources of variability in test scores. Talker effects 
are comparatively small for most measures, but there are 
also significant interactions of subject and talker. When test 
scores are interpreted as talker-specific, this interaction 
(provided it is nonzero) contributes to universe-score vari- 
ance and necessarily raises generalizability. When the uni- 
verse-score represents generalization across talkers, the 
interaction is only a source of error variance and hence 
reduces generalizability. 

This study's inclusion of day of testing as a source of 
variability in measures of speechreading performance ex- 
tends the findings of Demorest and Bernstein (1992) to 
permit estimation of generalizability over days, even for 
models in which all testing is conducted on a single day. 
Although there were modest improvements in mean perfor- 
mance over the 2 days of testing, interaction effects involv- 
ing days were generally quite small for all measures. Thus, 
individual differences among subjects were maintained and 
generalizability coefficients were not greatly affected. This 
result does not, however, preclude statistically significant 
practice effects and/or changes in subjects' relative perfor- 
mance within the first day of testing. For example, Rosen 
and Corcoran (1982) presented 21 16-item lists of Bamford- 
Kowal-Bench sentences (Bench & Bamford, 1979) and 
found that scores from the first three lists were less highly 



correlated with overall performance than were subsequent 
lists. That is, individual differences appeared to stabilize after 
about 50 sentences of practice. In the present study, sub- 
jects speechread 50 sentences from each talker on each 
day, along with 100 MRT words (for the male talker) and 92 
CV syllables from each talker. This raises the possibility that 
the high generalizability coefficients obtained here are, in 
part, dependent on the subjects' additional experience with 
these talkers, and that downward extrapolation to shorter 
lists of sentences might therefore overestimate generaliz- 
ability. A direct comparison between Rosen and Corcoran's 
correlations (their Fig. 3, p. 251) and results for the CID 
sentences under Model 3 (our Table 12) suggests that this is 
not the case. Interpolation for a 16-sentence list results in 
estimated generalizability of .77, which agrees almost per- 
fectly with Rosen and Corcoran's value for their first sen- 
tence list. More importantly, results from both studies sug- 
gest that practice with the talker, whether treated as practice 
only, or scored as part of a 40- to 50-item test, is important 
for reliable assessment of individual differences. 

The results presented in Tables 2, 7, 10, and 14 show that 
residual error is often the largest single source of variability. 
However, this is a consequence of using a minimal test as 
the frame of reference, that is, one token of each nonsense 
syllable, a single pair of words, or a single sentence. Test 
construction proceeds from this base and, as the general- 
izability formulas show, the impact of residual error and 
other sources of observed-score variance can be greatly 
attenuated by increasing test length. 

Performance Measures 

Responses to each type of material in this study were 
scored in terms of two or more performance metrics. Esti- 
mated generalizability coefficients differed somewhat from 
one measure to another, but it is clear that acceptable 
generalizability can be achieved with each measure, pro- 
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vided test length is adequate and test-score interpretation is 
talker-specific. 

Although it is tempting to select the performance measure 
for which generalizability is highest (and for which a shorter 
test is therefore permissible), such a strategy is ill advised. 
High generalizability is not an end in itself; it is a conse- 
quence of adopting, a priori, a particular testing protocol, 
performance measure, and universe of generalization. Each 
of these factors must be justified in terms of substantive 
considerations. With the construct thus defined, data on 
generalizability can then be used to develop a psychomet- 
rically sound testing protocol. 

The most striking general feature of the correlations 
among performance measures, taken as a whole, is the 
pattern of higher associations between MRT word measures 
and either CID or B-E Sentence measures than between CV 
syllable measures and either word or sentence measures. 
Utley (1946b) reported a comparable correlation between 
words and sentences and this pattern of correlations is 
completely consistent with contemporary research in word 
recognition (e.g., Duffy & Pisoni, 1992; Marslen-Wilson & 
Warren, 1994; Stanovich, 1980), which suggests that pho- 
neme recognition is not a stage in word recognition and that 
fluent word recognition is key to sentence processing. An 
implication of this result is that future studies of speech- 
reading need to vigorously pursue questions concerning 
word recognition and ignore previous statements in the 
literature suggesting that little is to be learned with this level 
of materials (for examples of this position, see Plant & 
MacRae, 1981, and Rosen & Corcoran, 1982). 



List Equivalence 

A caveat regarding interpretation of generalizability coef- 
ficients concerns the equivalence of alternate forms of a 
test. Two lists of words or sentences may rank individuals 
very similarly, yet mean performance on the two lists may be 
very different. Generalizability coefficients, like classical reli- 
ability coefficients, only reflect the degree to which two lists 
are expected to correlate. If it is important for tests to have 
equal means, such as for testing under different experimen- 
tal conditions, or for testing before and after some type of 
training, normative data are required. If norms are available 
for individual items, items can be blocked on difficulty and 
then randomly assigned to test lists during test construction. 
If norms are only available for whole test lists, and the lists 
do not have equal means, statistical adjustments (such as 
subtraction of the mean or conversion to Z scores) can be 
used to equate the lists. 

Estimation of Measurement Error 

In some situations, precise estimation of a subject's 
universe score is more important than his or her rank in a 
population. For example, in evaluating devices that supple- 
ment speechreading, it is important to know whether aided 
versus unaided scores are significantly different or whether 
apparent changes in scores over time are greater than would 
be expected on the basis of day-to-day variability in perfor- 



mance. In classical true-score theory, the standard error of 
measurement is used to set a confidence interval around a 
single score or to determine the standard error of the 
difference between two scores (Allen & Yen, 1979). An 
analogous estimate of measurement error for an observed 
score can be constructed using a generalizability coefficient 
in place of the reliability coefficient and using an estimate of 
observed-score variance based on the variance compo- 
nents in the denominator of the generalizability coefficient. 
The obtained standard error of measurement can then be 
used to form a confidence interval around an observed 
score. To form a confidence interval for the difference 
between two observed scores, the standard error must be 
increased by a factor of V2. 



Target Population 

Both the present study and that of Demorest and Bern- 
stein (1992) were conducted on adults with normal hearing. 
Individual differences in this population are of theoretical 
interest with respect to models of spoken language process- 
ing (Bernstein & Demorest, 1993). However, speechreading 
is most often tested in individuals with hearing impairment, 
either as a component of communication assessment in 
educational or rehabilitative settings or as part of the eval- 
uation of devices such as tactile aids or cochlear implants. 
Issues of test length, equivalence of alternate lists of words 
or sentences, comparability of results obtained from differ- 
ent talkers, and day-to-day variability in performance cannot 
be ignored in such contexts. Thus, it is important that 
models similar to those evaluated here be applied to data 
obtained from this target population. Such data are available 
in a study conducted by Bernstein, Demorest, and Tucker 
(1996) to directly compare the speechreading performance 
of populations with normal and impaired hearing. Further 
analyses of those data (Demorest, Bernstein, & Tucker, 
1996) will indicate the robustness of psychometric results 
reported here. 



Acknowledgments 

This research was supported by NIH Grant DC00695. The authors 
thank Kimberiy Sayampanathan, Craig Grossman, Tracey Collins, 
and Paula Tucker for their assistance in data collection and data 
processing. The paper benefited from thorough and thoughtful 
reviews, for which the authors are grateful. 

Descriptive statistics for individual items and analyses of variance 
for each type of material are available from the first author. 



References 

Allen, M. J., 4 Yen, W. M. (1979). Introduction to measurement 

theory. Monterey, CA: Brooks/Cole. 
Bench, J., & Bamford, J. (1979). Speech-hearing tests and the 

spoken language of hearing-impaired children. London: Academic 

Press. 

Bernstein, L. E., & Demorest, M. E. (1993, November). A general 
theory of speech perception must account for speech perception 
without audition (lipreadingtspeechreadmg). Paper presented at 



Demorest et al.: Speechreading Syllables, Words, and Sentences 713 



the 34th Annual Meeting of the Psychonomic Society, Washing- 
ton, DC. 

Bernstein, L. E., Demorest, M. E., & Eberhardt, S. P. (1994). A 
computational approach to analyzing sentential speech percep- 
tion: Phoneme-to-phoneme stimulus-response alignment. Jour- 
nal of the Acoustical Society of America, 95, 361 7-3622. 

Bernstein, L. E., Demorest, M. E., & Tucker, P. E. (1996). Speech 
perception without hearing. Manuscript submitted for publication. 

Bernstein, L. E., & Eberhardt, S. P. (1986a). Johns Hopkins 
Lipreading Corpus HI: Disc I [Videodisc]. Baltimore, MD: Johns 
Hopkins University. 

Bernstein, L. E., & Eberhardt, S. P. (1986b). Johns Hopkins 
Lipreading Corpus lll-IV: Disc II [Videodisc]. Baltimore, MD: Johns 
Hopkins University. 

Crocker, L., & Algina, J. (1986). An introduction to classical and 
modem test theory. Fort Worth, TX: Harcourt Brace Jovanovich. 

Cronbach, L. J., Gleser, G. C, Nanda, H., & Rajaratnam, N. 
(1972). The dependability of behavioral measurements: Theory of 
generalizability for scores and profiles. New York: Wiley. 

Davis, H., & Silverman, S. R. (Eds.). (1970). Hearing and deafness 
(3rd ed.). New York: Holt, Rinehart, & Winston. 

Demorest, M. E., & Bernstein, L. E. (1991). Computational explo- 
rations of speechreading. Journal of the Academy of Rehabilita- 
tive Audiology, 24, 97-1 1 1 . 

Demorest, M. E., & Bernstein, L. E. (1 992). Sources of variability in 
speechreading sentences: A generalizability analysis. Journal of 
Speech and Hearing Research, 35, 876-891. 

Demorest, M. E., & Bernstein, L. E. (1994, June). Relationships 
between confidence and performance in speechreading sen- 
tences. Paper presented at the meeting of the Academy of 
Rehabilitative Audiology, Salt Lake City, UT. 

Demorest, M. E., Bernstein, L. E., & Tucker, P. E. (1996). 
Generalizability of speechreading performance on nonsense syl- 
lables, words, and sentences: Subjects with hearing impairment. 
Unpublished manuscript. 

Duffy, S. A., & Pisoni, D. B. (1992). Comprehension of synthetic 
speech produced by rule: A review and theoretical interpretation. 
Language and Speech, 35, 351-389. 

Eberhardt, S. P., Bernstein, L. E., Demorest, M. E., & Goldstein, 
M. H., Jr. (1990). Lipreading sentences with single-channel vibro- 
tactile transformations of voice fundamental frequency. Journal of 
the Acoustical Society of America, 88, 1274-1285. 

Erber, N. P., & McMahan, D. A. (1976). Effects of sentence context 
on recognition of words through lipreading by deaf children. 
Journal of Speech and Hearing Research, 19, 112-119. 

Gesi, A. T., Massaro, D. W., & Cohen, M. M. (1992). Discovery and 
expository methods in teaching visual consonant and word iden- 
tification. Journal of Speech and Hearing Research, 35, 1180- 
1188. 



House, A. S., Williams, C. E., Hecker, M. H. L., & Kryter, K. D. 

(1965). Articulation-testing methods: Consonantal differentiation 
with a closed-response set. Journal of the Acoustical Society of 
America, 37, 158-166. 

Kreul, E. J., Nixon, J. C, Kryter, K. D., Bell, D. W., & Lamb, J. S. 
(1968). A proposed clinical test of speech discrimination. Journal 
of Speech and Hearing Research, 1 1, 536-552. 

Marslen-Wilson, W., & Warren, P. (1994). Levels of perceptual 
representation and process in lexical access: Words, phonemes, 
and features. Psychological Review, 101, 653-675. 

Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of 
speech as a function of the context of the test materials. Journal 
of Experimental Psychology, 41, 329-335. 

Montgomery, A. A., & Jackson, P. L. (1983). Physical character- 
istics of the lips underlying vowel lipreading performance. Journal 
of the Acoustical Society of America, 73, 2134-2144. 

Owens, E., & Blazek, B. (1985). Visemes observed by hearing- 
impaired and normal-hearing adult viewers. Journal of Speech 
and Hearing Research, 28, 381-393. 

Parasnis, I., & Samar, V. J. (1982). Visual perception of verbal 
information by deaf people. In D. G. Sims, G. G. Walter, & R. L. 
Whitehead (Eds.), Deafness and communication: Assessment and 
training (pp. 53-71). Baltimore: Williams & Wilkins. 

Plant, G. L., & MacRae, J. H. (1981). The NAL lipreading test 
development, standardisation and validation. Australian Journal 
of Audiology, 3, 49-57. 

Rosen, S., & Corcoran, T. (1982). A video-recorded test of lipread- 
ing for British English. British Journal of Audiology, 16, 245-254. 

Stanovich, K. E. (1980). Toward an interactive-compensatory 
model of individual differences in the development of reading 
fluency. Reading Research Quarterly, 16, 32-71 . 

Utley, J. (1 946a). A test of lip reading ability. Journal of Speech and 
Hearing Disorders, 11, 109-116. 

Utley, J. (1946b). Development and standardization of a motion 
picture achievement test of lip reading ability. Unpublished doc- 
toral dissertation, Northwestern University, Evanston, IL. 

Walden, B. E., Erdman, S. A., Montgomery, A. A., Schwartz, 
D. M., & Prosek, R. A. (1981). Some effects of training on speech 
recognition by hearing-impaired adults. Journal of Speech and 
Hearing Research, 24, 207-216. 



Received July 30, 1995 
Accepted February 15, 1996 

Contact author: Marilyn E. Demorest, University of Maryland 
Baltimore County, Department of Psychology, 5401 Wilkens 
Avenue, Baltimore, MD 21228-5398. E-mail: demorest@umbc2. 
umbc.edu 



